# Medicaid MCO Procurement Accountability Claims Analysis Dataset

## Dataset Overview

This dataset contains 372,283 thematic accountability claims extracted from Medicaid managed care organization (MCO) procurement documents across 32 US states, spanning 2017-2024. The analysis supports the manuscript "Evaluating Medicaid Managed Care Organization Accountability: Large Language Model Analysis of RFP Response Claims Across 32 US States."

## Citation

Basu S, Fleming A, Morgan J, Batniji R. Medicaid MCO Procurement Accountability Claims Analysis Dataset. Harvard Dataverse. 2025. https://doi.org/10.7910/DVN/6EFL00

## Data Collection

- **Source**: Publicly available state Medicaid procurement documents (RFPs, proposals, contracts, scoring materials)
- **Time Period**: 2017-2024
- **Geographic Coverage**: 32 US states and District of Columbia
- **Collection Methods**: State procurement portals, Medicaid agency websites, FOIA requests
- **Total Pages Analyzed**: ~460,000 pages

## Files Included

### Primary Data Files

| File | Description | Format | Records |
|------|-------------|--------|---------|
| `thematic_claims.csv` | All extracted thematic accountability claims | CSV | 372,283 |
| `exhibit2_theme_taxonomy.csv` | Theme and subcategory distribution | CSV | 36 |
| `exhibit3_temporal_by_theme.csv` | Annual claim volumes by theme | CSV | 48 |
| `exhibit4_regional_themes.csv` | Regional distribution by theme | CSV | 30 |
| `exhibit5_rfp_mco_concordance.csv` | RFP-MCO concordance ratios by state | CSV | 179 |
| `document_inventory.csv` | Source document catalog | CSV | 265 |

### Summary Files

| File | Description | Format |
|------|-------------|--------|
| `thematic_analysis_summary.json` | Aggregate statistics | JSON |
| `concordance_summary_by_theme.csv` | Concordance ratios by theme | CSV |
| `temporal_covid_comparison.csv` | Pre/post COVID comparison | CSV |

### Documentation

| File | Description |
|------|-------------|
| `codebook.md` | Variable definitions and coding schemes |
| `extraction_methods.md` | Technical documentation for RAG extraction pipeline |

## Thematic Categories

Claims were classified into six primary thematic domains:
1. **Chronic Disease Management** (28.9%, n=107,450): diabetes, hypertension, behavioral health, maternal health
2. **LTSS/Dual Eligibles** (18.0%, n=67,170): long-term services, nursing facilities, dual eligible populations
3. **Health Equity** (14.1%, n=52,528): racial/ethnic disparities, language access, disability, LGBTQ+ health
4. **Technology** (13.1%, n=48,754): telehealth, AI/predictive analytics, health information exchange
5. **Workforce** (13.1%, n=48,623): provider recruitment, cultural competency, community health workers
6. **SDOH** (12.8%, n=47,758): food insecurity, housing, transportation, social isolation

## Methodology

Claims were extracted using retrieval-augmented generation (RAG) with Claude Sonnet 4.5:
1. **Document chunking**: Vector embedding and semantic indexing
2. **Structured extraction**: JSON schema-constrained outputs with verbatim text
3. **Hallucination mitigation**: Temperature=0, source attribution, verbatim extraction
4. **Human validation**: Two-coder review with Cohen's κ=0.86

## Key Findings

- MCOs overemphasize technology (1.25×) and health equity (1.08×) relative to RFP requirements
- MCOs underemphasize chronic disease (0.54×) and workforce (0.70×) relative to RFP requirements
- Health equity claims increased 10.3-fold post-COVID (752 to 7,779)
- Regional variation: Midwest leads SDOH (47% food insecurity claims), West leads racial equity (40%)

## Limitations

- Document availability varies by state public records practices
- Some documents redacted or unavailable
- LLM extraction sensitivity 0.89 (11% claims potentially undetected)
- Analysis covers 2017-2024; does not include 2025 OBBBA effects

## Ethical Considerations

This dataset contains only publicly available government documents. No individually identifiable information is included. IRB exempt (45 CFR 46.104[d][4]).

## Funding

This research was funded by Waymark.

## Contact

Sanjay Basu, MD, PhD
Waymark / University of California, San Francisco
sanjay.basu@waymarkcare.org

## License

CC BY 4.0 - Creative Commons Attribution 4.0 International
